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ABSTRACT 

In a Spiking Neural Networks (SNN), spike emissions are 
sparsely and irregularly distributed both in time and in the 
network architecture. Since a current feature of SNNs is 
a low average activity, efficient implementations of SNNs 
are usually based on an Event-Driven Simulation (EDS). 
On the other hand, simulations of large scale neural net¬ 
works can take advantage of distributing the neurons on 
a set of processors (either workstation cluster or parallel 
computer). This article presents DAMNED, a large scale 
SNN simulation framework able to gather the benefits of 
EDS and parallel computing. Two levels of parallelism 
are combined: Distributed mapping of the neural topol¬ 
ogy, at the network level, and local multithreaded alloca¬ 
tion of resources for simultaneous processing of events, 
at the neuron level. Based on the causality of events, a 
distributed solution is proposed for solving the complex 
problem of scheduling without synchronization barrier. 
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1 Introduction 

Advancing the knowledge on cognitive functions, simu¬ 
lations of Spiking Neural Networks (SNNs) represent a 
bridge between theoretical models and experimental mea¬ 
surements in neuroscience. Unlike usual threshold or sig¬ 
moid neurons, models of spiking neurons take into ac¬ 
count the precise times of spike emissions. Therefore, 
SNNs help to simulate biologically plausible interactions 
between neurons and to study the influence of local pa¬ 
rameters, at the neuron level, on the network global be¬ 
havior, at the functional level. Results of very large scale 
SNN simulations can be analyzed the same way as ex¬ 
periments on animal or human, e.g. LFP [1:] or EEG 0 
recording, thus helping to understand how the brain works 
00. From a complementary point of view, theoretical 
studies 00 give large hope in the computational power 
of SNNs. Subject to the discovery of convenient learning 
rules □El, simulations of large scale SNNs would pro¬ 
vide efficient new solutions for many applications such as 
computer vision 0, adaptive control, real-time systems 
or autonomous robotics. 

For developing very large scale SNNs, supporting a 
wide variety of spiking neuron models, a general pur¬ 
pose and fast running simulation framework is necessary. 
The well known GENESIS Q0| and NEURON CED are 
good for simulating precise biophysical models of neu- 


rons, but based on time driven simulation, i.e. scrolling 
all the neurons and synapses of the network at each time 
step, they are not specifically designed for fast simulation 
of very large scale SNNs. In accordance with biologi¬ 
cal observations, the neurons of an SNN are sparsely and 
irregularly connected in space (network topology), and 
the variability of spike flows implies they communicate 
irregularly in time (network dynamics) with a low aver¬ 
age activity. Since the activity of an SNN can be fully 
described by emissions of dated spikes from pre-synaptic 
neurons towards post-synaptic neurons, an Event-Driven 
Simulation (EDS) is clearly suitable for sequential sim¬ 
ulations of spiking neural networks | f2j jT3] |14] (T33 fl6l . 
More generally, event-driven approaches substantially re¬ 
duce the computational charge of simulators that control 
exchanges of dated events between event-driven cells EC, 
without checking each cell at each time step. On the other 
hand, since parallelism is an inherent feature of neural 
processing in brain, simulations of large scale neural net¬ 
works could take advantage of parallel computing I17II18I 
or hardware implementations 11911201 12 TV A few studies 
coupled parallel computing and EDS, for general purpose 
systems 1221 . and for SNN simulation I23i l24l [25]l. 

Our simulator belongs to the latter family and is close 
to Grassmann’s work I23II25I . with some additional char¬ 
acteristics. Although it is known for long |26]| that a 
fine grain mapping of the network (e.g. one neuron per 
processor) is dramatically inefficient, due to high com¬ 
munication overhead, we think that a multithreaded im¬ 
plementation of neurons as event-driven cells EC is effi¬ 
cient. Unlike several simulators EO1241 l25l . we avoid 
the implementation of a unique controller or farmer pro¬ 
cessor for scheduling the network simulation. We pro¬ 
pose to direct the timing of execution through the times of 
events, without an explicit synchronization barrier. Hence 
we propose DAMNED, a “Distributed And Multithreaded 
Neural Event Driven” simulation framework that gathers 
the benefits of EDS and distributed computing, and com¬ 
bines two levels of parallelism (multiprocessor and multi¬ 
thread) for taking full advantage of the specific features of 
SNNs, whatever the models of spiking neurons to be im¬ 
plemented. Designed for efficient simulations either on 
workstation cluster or on parallel computer, the simula¬ 
tor is written in the object-oriented language C++, with 
the help of the MPI library to handle communications be¬ 
tween distant processors. Section |2] develops the speci¬ 


ficity of temporal events in SNNs. Section 0 defines the 
distributed multiprocessor architecture and specifies the 
role of the multithreaded processes. Section^Jdetails the 
algorithms and addresses the synchronization problem. In 
section[5] we conclude with an outlook on DAMNED ex¬ 
ploitation. 


2 Temporal events in SNNs 


In a typical neural network, at any time, each neuron can 
receive on its dendritic tree some signals emitted by other 
neurons. An incoming signal arrives with a delay dij and 
is weighted by a synaptic strength Wjj to be processed by 
the soma. The values of d tJ and uj 2j are specific to a given 
connection, from a presynaptic neuron Nj to a postsynap- 
tic neuron Nj. The membrane potential of a neuron varies 
in function of time and incoming signals. The neuron 
emits a spike, i.e. an outgoing signal on its axon, when¬ 
ever its membrane potential overcomes a given threshold 
9. In experimental setting, and thus for simulations, fir¬ 
ing times are measured with some resolution At, yielding 
a discrete time representation. Hence each spike can be 
considered as an event, with a time stamp, and each neu¬ 
ron can be considered as an event-driven cell ECj, able 
to forecast its next spike emission time, as result from the 
integration of incoming spikes. 
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Figure 1: Variations of membrane potential for a postsynap¬ 
tic neuron Nj. Three successive incoming Excitatory spikes let 
forecast an outgoing spike that must be cancelled afterwards, 
due to a further incoming Inhibitory spike, with a smaller delay 

di 3 j■ 

However, the way to compute the future time of spike 
emission can be complex, depending on the model of neu¬ 
ron. For instance, if ECj manages the incoming delays, a 
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further incoming spike, with inhibitory synapse, can can¬ 
cel the forecast of an outgoing spike before the stamp as¬ 
sociated to this event (see Figure [TJ. Hence we have to 
address the delayed firing problem (see 12311141 0311. 

Since the activity in the network is unpredictable, in or¬ 
der to preserve the temporal order of events, for the sake 
of biological plausibility, we ought to control the uncer¬ 
tainty of spike prediction. In the context of C++ language 
programming, we have chosen the following data struc¬ 
ture for classes of “events objects” : 


CM event (resulting from ComMunicatiori) 
= incoming spike, to be computed 


label of 

label of 

time stamp of 

target neuron 

source neuron 

spike emission 

Nj (integer) 

Ni (integer) 

sti (integer) 


CP event (resulting from Computation ) 
= outgoing spike, to be emitted 


label of 

time stamp of 

certification 

source neuron 

spike emission 

flag 

Ni (integer) 

sti (integer) 

crt (boolean) 


where crt is true only if the typical time of local run, on 
the processor implementing the neuron Ni, is high enough 
to guarantee that no further incoming spike could ever 
cancel the CP event (see section^Jfor further details). 

Each class of “event-driven cells” EC objects is in 
charge of the computation methods associated to a model 
of spiking neuron, e.g. Integrate-and-Fire (IF, LIF), Spike 
Response Model (SRM), or other (see 6271 [28l ). so that 
different neuron types can be modeled in an heteroge¬ 
neous SNN. Classically, an ECj object modeling a neu¬ 
ron Nj has among its attributes the firing threshold 9j of 
the neuron, the synaptic weights Wij and the delays dij of 
the connections from all the presynaptic neurons N able 
to emit a spike towards Nj. 

3 Distributed architecture and 
threads 

The neural network topology must be distributed on P 
processors according to a static mapping, to be defined as 
convenient for the application to be simulated. Each pro¬ 
cessor Pr p implements a certain amount of EC objects, 
labelled by their neuron number Nj. Each processor Pr p 


runs simultaneously two main threads so called CMC and 
CPC, for “ComMunication Controller” and “Computa¬ 
tion Controller” respectively and as many extra threads as 
simultaneously computing neurons (see Figure^. Incom¬ 
ing spikes intended to be computed by every neurons Ni 
belonging to processor Pr p are stored in a priority queue 
of CM events, ordered by their spike time stamp. Out¬ 
going spikes resulting from computations of neurons N t 
belonging to processor Pr p are stored in a priority queue 
of CP events, ordered by their spike time stamp. They 
are intended to be sent by the CMC process to all the tar¬ 
get neurons Nj of Ni, whatever they belong to Pr p or 
to another processor. Processor Pr p knows the tables of 
postsynaptic neurons (neuron numbers Nj and number m 
of processor Pr m implementing Nj) for all its neurons 
Ni. For local target neurons, i.e. Nj £ Pr p , a CP event 
[Ni, sti , crt] from the CPC queue generates CM events 
[Nj, Ni, sti\ in the CMC queue of the same processor. 
For distant target neurons, each CM event [Nk, N t , stj] 
is packeted into a message to be sent to processor Pr m 
implementing A+. 

As illustrated by Figure|2] each processor runs in paral¬ 
lel: two main threads, CMC and CPC, with mutual exclu¬ 
sion for accessing each other priority queue. The CMC 
and CPC threads continuously run each an infinite loop, 
on the following procedures 

CMC, ComMunication Controller 

1. message reception : If messages from other proces¬ 
sors are available, then place all the received CM 
events [Nj, Ni, si,] inside the CMC priority queue, 
ordered by their time stamp sti (or by arrival time, if 
equal time stamps exist), 

2. emission control. If the next outgoing spike 
[Ni, sti, crt], at the top of the CPC queue, is autho¬ 
rized, then look at the table of target neurons of N t , 
create the CM events [Nj, Ni, sti] for all the postsy¬ 
naptic neurons Nj and place them either in the local 
CMC queue, if Nj £ Pr p , or in packets prepared for 
further message sending, 

3. message sending: If packets are ready, then send 
them to the target processors. 

All messages are sent and received according to MPI 
communication protocols. The receive and send proce¬ 
dures do not stall waiting for effective messages at each 
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Figure 2: Architecture of a processor Pr v , for p ^ 0. Several threads run simultaneously: A CMC thread, a CPC thread and 
as many threads as currently computing neurons. CMC receives spike events coming from other processors (step CMC 1). CMC 
inserts the incoming events in the CM priority queue according to their time stamp sti. CPC checks if the top CM event is 
authorized for computation (step CPC 1). If authorisation is granted, the thread associated to ECj processes the [Nj,Ni,sti] CM 
event. If Nj triggers, the resulting spike generates a new CP events that are inserted in the CPC priority queue (step CPC 2). CMC 
checks if the top CP event is authorized for emission. If so, the spike event is dispatched towards all the target neurons, generating 
CM events that are inserted either in the CMC queue or in packets to be sent to other processors. 


loop step. They only check if messages are present in 
their communication buffers. They process them if rele¬ 
vant, otherwise the loop goes on. 

CPC, Computation Controller 

1. computation starter : If the next incoming spike 
[Nj. Ni, sti] at the top of the CMC queue, is autho¬ 
rized, then launch the thread associated to ECj that 
implements neuron Nj 

2. result collector : If a new spike [Nj,stj,crt] has 
been generated by ECj, then place the event inside 
the CPC priority queue, ordered by its time stamp stj 
(or default, arrival time if some other time stamps are 
equal) 

Each time an incoming spike is computed by a neu¬ 
ron Nj, the associated thread is activated. Since the CPC 
runs an infinite loop on the computation starter and re¬ 
sult collector procedures, several other threads, on EC 
objects, can be active simultaneously, thus implement¬ 
ing concurrent computation of several neurons, locally on 
processor Pr p (Figure On each processor, nbth p rep¬ 
resents the number of active threads on Pr p . The variable 
nbth p is incremented by the computation starter proce¬ 
dure each time a thread is activated for an EC object, and 
decremented by the result collector procedure each time 


a thread ends. The EC object keeps a pointer to every CP 
event it has generated as far as the event is present in the 
CPC queue. Hence the EC object can modify some certifi¬ 
cation flags if a new information allows it to authenticate 
some old-queued events that have not yet been emitted. 
For complete explanation of how to manage the delayed 
firing problem, let us define two other local variables: 

et p is the current emission time on processor Pr p 

pt p is the current processing time on processor Pr p 

The variable et p switches to its opposite negative value if 
the CPC queue (spikes to be emitted) becomes empty, and 
switch back to opposite positive value when new events 
arrives in CPC queue. Same behavior for the variable pt p 
according to the state of the CMC queue (spikes to be 
processed). Those two variables play a fundamental role 
for controlling the scheduling on the set of processors and 
for defining the conditions of emission and computation 
authorizations, as detailed in section^] 

A last point about the distributed architecture: The neu¬ 
ral network topology is spread onto P processors. How¬ 
ever, a realistic simulation requires an interaction with the 
environment. Hence an extra processor Pro is necessary 
to send the environment stimuli to input neurons (dis¬ 
tributed on several Pr p , with p > 1) and to receive the 
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response from output neurons (also distributed on several 
Pr q , with q > 1). We prevent the processor Pro to be a 
controller or a farmer, but the way it helps the scheduling 
of the whole simulation is also fundamental, as explained 
in next section. 


4 Synchronization control methods 


The main point is to keep an exact computation of firing 
times for all the neurons of the network, and to prevent 
all the processors from deadlock situations, despite of the 
may-be irregularly distributed firing dynamics that can re¬ 
sult from the environment data. The processor Pro is in 
charge to send to the neural network all the input stimuli 
generated by the environment (e.g. translation in tempo¬ 
ral coding of an input vector) and to receive all the spikes 
emitted by the output neurons of the network. 

Pro knows the actual time T of the environment, and 
its current emission time eto . At initial time, all the pro¬ 
cessors emission times et p are set to 0. While the sim¬ 
ulation runs, each processor, including Pro, may have a 
partial and obsolete view of the clocks of the other pro¬ 
cessors. Each processor Pr m , 0 < m < P, owns a local 
clock array Clk[m) storing the emission times it currently 
knows, for all the processors Pr p , 0 < p < P 


Clk(m) 


eto(m) 


eti(m) 


etp (to) 


Each time a processor Pr p sends a packet of events 
(spike emissions) to a processor Pr m , the message is en¬ 
capsulated with the local clock Clk(j>). Hence the clock 
Clk(m) can be updated each time Pr m receives a mes¬ 
sage. We assess that the whole network scheduling can be 
achieved this way, due to the local management of event 
causality. Since this way of controlling synchronization 
does not require “look-ahead” query-messages, we pro¬ 
pose a more flexible method than the “safe window” so¬ 
lution described in EH. 


Environment processor Processor Pro is the only one 
that is not subject to the delayed firing problem since the 
environment processor relays all the input stimuli towards 
the neural network. Hence Pro knows exactly the dates 
of all the external spikes that will trigger the input neurons 
of the SNN. Each time Pro increments the actual time to 


T, all the packets with time stamps T—l are ready for im¬ 
mediate sending. Messages are sent to all the processors 
Pr m , m > 1, with the following information: 

• the current update of the clock Clk{ 0), where eto 
has just been incremented to T 

• if relevant, all the CM events [Nj, Ext , T—l] that 
will generate spike emissions at time T on Nj input 
neurons owned by processor Pr m 

Even if a processor Pr m does not own input neurons, 
or if it owns input neurons that do not trigger at time 
T, it will receive a message with the clock Clk(0). 
Note that the environment processor Pro is the only 
one that can send messages reduced to the clock. Since 
all the processors are aware of the last update of eto 
“immediately”, or as soon as the message can be trans¬ 
mitted [we assume reliable communication channels], 
the argument (m) will be next omitted in notation eto (to) . 

The simulation starts running by the incrementation of 
T to 1. Since spike events communication must respect a 
causal order, the following conditions are always true, on 
every processor Pr m (arguments have been omitted): 

T > 0 and eto > 0 all along the run, after simulation start 
(Vp > 1) eto > \et p \ 

T and (Vp > 0) \et p \ are never decreasing 
(Vp > 0) \pt p | is never decreasing 
(Vp > 1) eto > \pt P \ 

The links between emission time et p and processing time 
pt p are clarified below, where algorithms that govern 
emission and computation authorizations are detailed. 

Each time Pro receives an output event packet, it for¬ 
wards all the CM events [Ext, Ni, sti\ to environment 
manager for further external processing, it updates its 
clock Clk( 0), from the received clock Clk{q), as follows: 
(Vp > 1) if |etp(g)| > |ef p (0)| then et p (0) <— et p (q)\ 
if (3j > 1) \etj(q)\ = T then T <— T + 1; eto *- - T; 

If T has been incremented, then Pr 0 sends the appropri¬ 
ate messages to all the Pr m . Note that Pr 0 may receive 
several output spike events, coming from different pro¬ 
cessors, between two successive increments of T. Con¬ 
versely, it is possible that no output neuron send spike 
emission, at a given time T, and then Pro does not re¬ 
ceive any message and does not update its clock. Such a 
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case would result quickly in stalling all the processors, by 
blocking their emission and computation authorizations. 
For preventing the system from deadlock, we assume that 
a time-out is running on Pro and that T is incremented 
when time-out expired, which is coherent with the notion 
of actual time represented by T. Hence eto is updated to 
T and, provided that the time-out is sufficiently long, all 
the et p ( 0) can be set up to T — 1. Messages are sent to 
all the processors, with updated clock Clk( 0) and possi¬ 
bly new spike events generated by external stimuli. This 
time-out is rarely activated but it prevents the system to 
fall into deadlock when the dynamics of the SNN is re¬ 
duced to very low overall activity or activity loops that 
risk to be localized on a single processor or on a cluster 
with no output spikes. 

CMC algorithms for emission authorization 

On each processor Pr p , for p > 1, the CMC runs an 
infinite loop on the successive procedures of message 
reception, emission control and message sending (see 
section^}. 

At each message sending, processor Pr p checks if a 
packet of required size (minimal packet size minpak 
is a parameter) is ready to be sent to another processor 
Pr m ■ In case of successful checking, processor Pr p 
encapsulates the ready-to-be-sent packet with its current 
clock array Clk(p) and sends it to the target processor. At 
each message reception from a processor Pr q , the CMC 
of Pr p inserts the incoming CM events in its priority 
queue, sets back the processing time to a positive value: 
pt p <— \pt p \, and updates its local knowledge of the et m 
on other processors as follows: 

(Vm t^p) if \etm{q)\ > \et m {p)\ then et m (p) <— et m {q)\ 

The emission control procedure picks up a new CP 
event [Ni , sti , crt} form the top of the CPC priority 
queue. This event has been computed by a local thread 
activated by the neuron iV* and has generated a spike 
emission forecasted for time sti. In order to respect the 
causality of events, the CMC process has to check the 
authorization to communicate this event, by the following 
algorithm: 

if sti = et p then emission is authorized; 


else if crt then emission is authorized; 

else if sti < pt p then emission is authorized; 
else if nbthp = 0 then 

if ptp < 0 and (Vm ^ p) [sti < etm or et m < 0] 

then emission is authorized; 
else emission is delayed; 
else emission is delayed; 

If the emission is authorized, the CMC process updates 
the local emission time: et p <— sti 
If the CP event is authorized then it is removed from 
the CPC queue. Each time the CPC queue becomes 
empty, the local emission time is changed to its opposite 
etp <— — et p in order to indicate that there are no more 
spike emissions to communicate, at present time, on 
processor Pr p . If the emission authorization generates, 
from the postsynaptic table of neuron N t , new CM events 
to be further processed by one or more neurons local to 
Pr p , then the processing time pt p takes back a positive 
value: pt p <— |pt p |. 

The present algorithm controls that an authorization 
to be emitted can not be delivered to a spike event 
[Ni, sti, crt} before its validity has been assured, regard¬ 
ing to the overall run of the simulated SNN. The emis¬ 
sion of the spike event is authorized if we are sure that 
all the further computations of neuron Ni can not invali¬ 
date the present spike, either due to other computations lo¬ 
cally running on processor Pr p (controls on et p , pt p and 
nbth p ) or to distant spike events further incoming from 
other processors (controls on et m , for all m / p). Even if 
a spike emission has been delayed only because the local 
clock Clk(p) was not correctly updated, we avoid to over¬ 
load the communication network with query messages, 
since the possible idle state is guaranteed to be ended by 
the reception of either new incoming events from other 
processors or clock messages coming from Pro. 

CPC algorithms for computation authorization On 

each processor Pr p . the CPC runs an infinite loop on 
the successive procedures computation starter and result 
collector (see section^}. 

The computation starter procedure picks up the top 
CM event [Nj,Ni,sti] of the CMC process priority 
queue. This event notifies that the neuron N t has emitted 
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a spike at time stj towards neuron Nj. The CPC process 
is in charge to deliver the computation authorization, 
according to the following algorithm: 
if the thread associated to ECj is already active 

then { ECj gets priority status [for further computation]; 

computation is delayed; } 
else if sti = pt p then computation is authorized; 
else if nbthp = 0 then 

if (Vm) [sti < et m or et m < 0] 
then computation is authorized; 
else if local deadlock is detected then 

if sti < sti (next event top of CPC queue ) 
then computation is authorized; 
else computation is delayed; 

with the following condition for local deadlock detection: 

if et p < sti and (Vm ^ p) [sti < etm or et m < 0] 

If the computation is authorized, the CPC updates both 
the local processing time: pt p <— sti 
and the number of locally active threads nbth+ + . Each 
time the CMC queue becomes empty, the local processing 
time is changed to its opposite pt p < - pt p . 

The present algorithm authorizes the computation 
of only one event at a time by a given neuron Nj (the 
computation is delayed if the thread of ECj is active) 
and regulates the computations, via the variable pt p , 
according to the whole network advancement state, 
known by way of the clock Clk(p) (controls on all the 
et m ). Once again, we avoid communication overhead, 
even if computation is delayed for a moment, due to 
an obsolete clock, since the problem will be solved 
by further reception of messages coming from other 
processors. 

The result collector scans the active threads, first for 
an EC with priority status (if relevant) or in a loop on the 
number of the currently active threads. If a neuron Nj 
computation of an event is over (i.e. thread ended), then 
the number of active threads is decremented nbth~~. The 
result of the computation is either null or a new outgoing 
spike event [Nj, stj, crt] that the result collector inserts 
in the CPC queue. If the CPC queue was previously 
empty, then the emission time et p takes back a positive 
value: et p <— \et p \. 

Moreover, the computation of an event for a neuron Nj 


induces the certification of old events [Nj, stj , crt] still 
present in the CPC queue, crt <— “true” each time stj is 
less or equal to the currently processed sti plus the mini¬ 
mal delay = mirii(dij ) of neuron Nj. 

5 Conclusion 

We have designed a framework dedicated to event-driven 
simulation of very large neural networks of biologically 
plausile spiking neurons. The DAMNED simulator is 
based on two levels of parallelism: At a coarse grain 
level, the SNN is distributed on several processors; At a 
fine grain level, local computations of neurons are mul¬ 
tithreaded on each processor. Since local clock updates, 
based on event causality, are managed via spike events 
message passing, both time-consuming synchronization 
barrier and centralized farmer processor can be avoided. 

Presently, the simulator has been successfully tested on 
a toy SNN, with a basic model of spiking neuron. Further 
work will include implementation of large heterogeneous 
SNNs. Time measurements and speed-up evaluations will 
be performed both on workstation clusters and on parallel 
computers (e.g. at IN2P3 and C3I computation centers). 
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